What is claimed is: 

1 . A method of detecting a process failure in a distributed system, the method comprising 
steps of: 

(1) measuring a first period of time between an instance a last heartbeat was received 
from a first process and a later instance in time; 

(2) measuring a second period of time between an instance a last heartbeat was received 
from a second process and said later instance in time; 

(3) comparing said first and second periods of time with a predetermined threshold; and 

(4) determining whether a process failure occurred in response to said comparison in step 

(3). 

2. The method of claim 1, wherein step (3) further comprises steps of: 

calculating a difference between said first period of time and said second period of time; 

and 

comparing said difference to said predetermined threshold. 

3. The method of claim 2, wherein step (4) further comprises steps of: 

detecting a failure of said second process in response to said difference exceeding said 
predetermined threshold. 

4. The method of claim 1, wherein said steps are performed as computer-executable 
instructions on a computer-readable medium. 

5. The method of claim 1, wherein said distributed system includes one network. 

6. A method of detecting a network failure in a distributed system, the method comprising 
steps of: 

(1) determining whether a heartbeat is received from at least one process in the 
distributed system prior to an expiration of a heartbeat timeout; and 
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5 (2) detecting a failure of a network in said system in response to not receiving said 

6 heartbeat from said at least one process prior to said expiration of said heartbeat timeout. 

1 7. The method of claim 6, wherein said steps are performed as computer-executable 

2 instructions on a computer-readable medium. 

1 8. The method of claim 6, wherein said distributed system includes one network. 

19. A distributed system including a plurality of hosts connected via a network, wherein each 

2_ host executes a process in said distributed system, said system comprising: 

a first host of said plurality of hosts executing a first process; wherein said first is 

!4f operable to detect one of failure of a second process executing on second host and failure of said 

^5j network based on a period of time to receive a heartbeat transmitted from at least one of said 

% plurality of hosts. 

10. The system of claim 9, further comprising: 
& a third host of said plurality of hosts executing a third process; wherein said first host is 

|3j operable to measure a first period of time between an instance a last heartbeat was received from 

W said third host on said network and a later instance in time and measure a second period of time 

5 between an instance a last heartbeat was received from said second host and said later instance in 

6 time; 

7 said first host being further operable to compare said first and second periods of time 

8 with a predetermined threshold, and detect a failure of said second process in response to said 

9 comparison. 

1 11. The system of claim 10, wherein said first host is further operable to calculate a 

2 difference between said first period of time and said second period of time, and compare said 

3 difference to said predetermined threshold. 

1 12. The system of claim 1 1, wherein said first host is operable to detect said failure of said 

2 second process in response to said difference exceeding said predetermined threshold. 
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1 13. The system of claim 12, wherein said first process is operable to remove said second 

2 process from a view in response to detecting said failure of said second process. 

1 14. The system of claim 9, wherein said first host is operable to determine whether a 

2 heartbeat is received from at least one other host in said system prior to an expiration of a 

3 heartbeat timeout. 

1 15. The system of claim 14, wherein said first host is further operable to detect said failure of 

2 said network in response to not receiving a heartbeat from said at least one other host prior to 

3 said expiration of said heartbeat timeout. 



HP Docket No.: 10010268-1 



